Dev v0.0.2 data preprocess #3

wli51 · 2025-09-22T18:26:11Z

Summary

This PR adds a preprocessing notebook for the DepMap PRISM secondary drug repurposing dataset, producing a clean, deduplicated table of drug–cell line IC50 values for downstream agentic experiments.

Key Changes

Config file + template
- Introduces a config file (ignored) for global configuration including the location of downloaded PRISM data and API keys for language models. The config.yml.template is included in place of the ignored config.yml.
Notebook for deduplication, merging screens and generating summary visualization of the dataset
- Resolves duplicate drug–cell line pairs within each screen (HTS002, MTS010).
- Prioritizes entries with the highest curve‐fit quality (r^2)
- Tabulates dataset composition by tissue and cell line.
Trivial unit pytest to ensure the script version of notebook runs.

… of preprocessing output

review-notebook-app · 2025-09-22T18:26:16Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

d33bs · 2025-09-29T18:17:41Z

analysis/0.data_wrangling/nbconverted/0.1.wrangle_depmap_prism_data.py

+try:
+    from IPython import get_ipython
+    shell = get_ipython().__class__.__name__
+    if shell == 'ZMQInteractiveShell':
+        print("Running in Jupyter Notebook")
+        IN_NOTEBOOK = True
+    else:
+        print("Running in IPython shell")
+except NameError:
+    print("Running in standard Python shell")


This looks like a handy utility function in the making. Consider elevating this to be reusable across other notebooks without duplication.

d33bs · 2025-09-29T18:19:58Z

analysis/0.data_wrangling/nbconverted/0.1.wrangle_depmap_prism_data.py

+git_root = subprocess.check_output(
+    ["git", "rev-parse", "--show-toplevel"], text=True
+).strip()
+config_path = pathlib.Path(git_root) / "config.yml"


To avoid a subproc call (which can become a challenge), consider using a pattern whereby the Jupyter notebook has access by default to the root of the project. This can be configured using a settings.json file if using VS Code to run notebooks. Alternatively, if using the Jupyter web interface, consider running Jupyter from the root of the project repo and navigating to the notebook.

Configured default wd of notebooks to be the project root and dropped the subproc call!

d33bs · 2025-09-29T18:22:32Z

analysis/scripts/0.data_wrangling/0.1.wrangle_depmap_prism_data.py

+if not config_path.exists():
+    raise FileNotFoundError(f"Config file not found at: {config_path}")


Consider making use of pathlib.Path.resolve(strict=True), possibly embedded in the above variable label association.

d33bs · 2025-09-29T18:24:37Z

analysis/scripts/0.data_wrangling/0.1.wrangle_depmap_prism_data.py

+data_cfg = config.get("data")
+if not data_cfg:


Consider the use of the walrus operator here to help check and assign at the same time.

Suggested change

data_cfg = config.get("data")

if not data_cfg:

if not (data_cfg := config.get("data")):

d33bs · 2025-09-29T18:29:22Z

analysis/scripts/0.data_wrangling/0.1.wrangle_depmap_prism_data.py

+for key in required_keys:
+    value = data_cfg.get(key)
+    if value is None:
+        results.append((key, None, "Missing in config"))
+        errors.append(f"Config key '{key}' is missing")
+        continue


Seeing these checks made me wonder if it might make sense to use yaml schema validation. It's common to use jsonschema to do this because yaml is a superset of json. If you move in this direction it takes a lot of the guesswork out of validating whether you have an object of a certain structure in yaml or json.

Fixed using json schema!

d33bs · 2025-09-29T18:35:58Z

analysis/scripts/0.data_wrangling/0.1.wrangle_depmap_prism_data.py

+        label.set_rotation(90)
+
+    plt.tight_layout()
+    plt.show()


Would it be possible to save the plots? It might make comparisons as things proceed more clear. This might mean making the notebook check more flexible.

Made save plot default in both notebook/script mode and only skips showing plot in script!

d33bs · 2025-09-29T18:38:25Z

config.yml.template

+api:
+  openai:
+    key: "YOUR_OPENAI_API_KEY"


Depending on how this key is loaded consider the use of python-dotenv to help keep the key out of a file checked into source. Otherwise, or maybe either way, be sure to gitignore this file.

Will decide whether to adopt the change in future with more needed environment variables

d33bs · 2025-09-29T18:48:56Z

tests/analysis/test_wrangle_depmap_prism.py

+    script_path = repo_root / "analysis" / "0.data_wrangling" / "nbconverted" / "0.1.wrangle_depmap_prism_data.py"
+
+    # Run from repo root so `git rev-parse --show-toplevel` and config.yml resolve
+    result = subprocess.run(
+        ["python", str(script_path)],
+        cwd=repo_root,
+        capture_output=True,
+        text=True,
+        env={**os.environ},  # inherit env


Instead of using subproc to run this as a script, consider using Pythonic imports. If you install the work as part of the venv you can use something like from analysis import wrangle. This might change the way you name or store things but would make the work more clear, Pythonic, and likely faster (Python's interpretation of the code would be faster than waiting for a new Python interpreter call to complete).

d33bs · 2025-09-29T18:49:54Z

.gitignore

+/data/processed/processed_depmap_prism_ic50.csv
+
+# Actual config.yml
+config.yml


Consider adding this file's details to the readme.

…guration

…arate package importable by all notebooks

…pacakge, also attempts to reduce dependency on subproc to retrieve the git project root with pathing util

…ipt modes

…plots in script mode.

…k details

wli51 · 2025-10-02T17:41:17Z

Thanks for reviewing Dave, merging!

wli51 added 5 commits September 22, 2025 12:19

Add config.yml to .gitignore to prevent tracking of configuration files

e490d0b

Add config.yml.template

6f818e1

Add data wrangling script for DepMap PRISM dataset preprocessing

9965e3b

Add test for wrangle_depmap_prism script execution

96565ea

Add processed_depmap_prism_ic50.csv to .gitignore to prevent tracking…

aba3db4

… of preprocessing output

d33bs approved these changes Sep 29, 2025

View reviewed changes

wli51 added 13 commits September 29, 2025 16:40

Add VSCode settings for Python environment and Jupyter notebook confi…

c14df83

…guration

Add utility function for detecting if running in notebook, make a sep…

3433e20

…arate package importable by all notebooks

Elevante IN_NOTEBOOK detection to a utils module of an importable sub…

f2f9d0d

…pacakge, also attempts to reduce dependency on subproc to retrieve the git project root with pathing util

Move notebook under subdir notebook

ace7201

Update default markers in pathing utility to include .env and LICENSE

cfe2bdd

Add pyproject.toml for nbutils package configuration

8764ad9

Add script to convert Jupyter notebooks to Python scripts

2127824

Update README.md with project setup instructions for notebook and scr…

929dbf7

…ipt modes

Use external json schema for config yml validation

1f4c740

Make data wrangling notebook always save plots and only skip showing …

cca528e

…plots in script mode.

Refactor test script path and improve assertion error message formatting

6e5ebd9

Update README.md with configuration requirements and analysis noteboo…

39e4e45

…k details

Update analysis README.md to rename notebook and improve formatting

3791ad5

wli51 merged commit c40a11d into WayScience:main Oct 2, 2025

		if not config_path.exists():
		raise FileNotFoundError(f"Config file not found at: {config_path}")

	data_cfg = config.get("data")
	if not data_cfg:
	if not (data_cfg := config.get("data")):

Dev v0.0.2 data preprocess #3

Dev v0.0.2 data preprocess #3

Uh oh!

Conversation

wli51 commented Sep 22, 2025

Summary

Key Changes

Uh oh!

review-notebook-app bot commented Sep 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wli51 commented Oct 2, 2025

Uh oh!

Uh oh!